The PC-SIG Library 10

home *** CD-ROM | disk | FTP | other *** search

/ The PC-SIG Library 10 / The PC-Sig Library - Shareware for the IBM PC and Compatibles (PC-SIG)(Tenth Edition Disks 1-2804)(1991).iso / PC_SIGCD / 07 / 3 / DISK0731.ZIP / CAFR next >

Wrap

Text File | 1987-02-09 | 11KB | 320 lines

Date: January 25, 1987 For more information: James E. Tarvid Logical Support State Road 700 at Peach Bottom Creek Route 2 Box 73 Independence, Virginia 24348 (703) 773-3419 ----------------------------------------------------------------------------- FILE RETRIEVAL USING INVERTED BIT VECTORS - James E. Tarvid ABSTRACT This paper describes a computer software technology for rapid retrieval of information from unstructured data such as collections of letters, resumes, abstracts, reports, or depositions residing on a mass storage device such as diskettes, hard disks, and compact disks using compact indices. The method uses a two step process of indexing and retrieval. The index structure employed is a high density inverted bit matrix indexed by word and file. A prototype for demonstrating the technology is available for personal computers. INTRODUCTION Most of the keystrokes eventually stored on computer systems comprise tokens or more simply words of identity, quality and quantity. The information contained therein depends on both the form and the content of the information structures and their interpretation by the end user. The expansion of computer usage has increased that proportion of stored information which is unstructured while the evolution of computer technology counters with ever greater yet still primitive understanding of information structures themselves. In spite of the advances in "artificial intelligence", we are years away from adequately automatically translating raw text into information structures. Only a small amount of the information entered on computers is available in a form suitable for retrieval by "fourth generation languages" (4GL). In the mean time, there is a need to support human activity in retrieving information from that vast bulk of unstructured information. INFORMATION STRUCTURE ASSUMPTIONS Unstructured information such as letters, depositions, resumes, reports and abstracts are generally stored as text files. From the primitive computer point of view, these files consists of one long string of characters. Without too much loss of information or functionality, we can consider the files to consist of words separated by punctuation and blank space. Obviously, the information contained depends on the juxtaposition of qualifiers and quantifiers, but for the purposes of the technology described herein, we shall be content with the notion that text is a sequence of words with no discernible significance in their order. COST FACTORS In computer parlance, we are predicating a structure of entities, which are files, and attributes, which are words (independent of the position of words within the file). More naturally we are talking about an index of files by word. Obviously, a more sophisticated structure is conceivable but we are responding here to economic needs and constraints of space and time. The index consists of entity-attribute couples which have as items a word and a file which contains that word. We can produce such an index by scanning files, producing file-word couples, sorting on word, eliminating duplicates and merging. Given such an index, it is easy to see how to quickly return a list of files containing a particular word. It is also easy to see that such an index would be much larger than the text itself. INFORMATION COMPRESSION Natural text is frequently compressed by the use of abbreviations. We commonly accept "USA" for the "United States of America" or "MD" for "medical doctor. Some technical compression efforts are natural and easy. For example, the identifier for a file can be reduced to a number if a corresponding structure maps that number back into a file name. That is we can use the number "5" for the file "xyz" if the fifth slot of a table contains "xyz". Now we can represent the index as a dictionary with a list of file numbers. The size of the dictionary is still rather large but efficiency improves as the number of files increases and in some cases the costs of the dictionary may well be within value of the index. Still to be generally useful, a much smaller index is needed. Less naturally, words can be converted into numbers by treating the characters as mathematical operands. At the simplest we can treat the word as a large binary integer. For text encoded in ASCII, the American Standard Code for Information Interchange", the range of values produced extends from 65 for the binary bit pattern for the word "A" into the billions for binary bit patterns for short words such as "name". The range can be compressed by restricting the domain to the 26 letters ignoring case, but the range of radix 26 number is still large and consequently not very dense. Fortunately, a technique known as hashing has been around for decades. Hashing words consists of converting the letters making up a word to a number and then "folding" the number into a limited range by dividing by a constant and taking the remainder. The hash code may be thought of as a mathematical abbreviation. There is a loss of specificity since two or more words may produce the same remainder much as an abbreviation may stand for two different words ("Dr" for "doctor" and "drive"). Such words are called "scramble synonyms". The loss of specificity, while troublesome in some regards, permits a much greater density of information in the index. At this point, the problem being addressed can be reduced a square matrix of code, file number pairs. Both of the matrix indices are dense and can be represented as a "bit" matrix where the row index is the hash codes and the column index is the file number and a bit indicates which files contain a word which generate the corresponding hash code. PROGRAM LOGIC Producing the index and using it become a simple matter. Indexing consists of reading each file to be indexed, hashing each word, and setting the bit in a vector corresponding to the hash code. Each file produces a separate vector. The set of vectors comprises a matrix which is transposed and written to disk. If the matrix is too large to fit in memory successive transposed matrices are merged. In a hierarchical directory system, the scope of files to be indexed can be presumed to be a directory node, its files, and all its sub nodes and their files. Unwanted files may be classified by extension or type and indexing of common words suppressed. Incremental indexing can be provided on file attributes such as "date". Retrieval consists of hashing each word in a "keyword list", retrieving the rows corresponding to the hash codes, "and-ing" the rows together, searching the row for set bits, and converting the index to a file name. Other boolean operators in the retrieval language such as "or" for synonyms are easy to implement. SCRAMBLE SYNONYM PROBLEM As noted before, the price of using a hash code to compress the index is that two or more words may have the same code. A retrieval specifying a single word is likely to include names of files which do not in fact contain the word. First, it should be noted that every file containing the specified word is returned by a retrieval. Second, the use of multiple keywords in a retrieval greatly diminishes the number of "false hits". Third, a post retrieval scan could be made. In fact, using inverted bit matrices before doing more complicated searches could greatly improve their efficiency as well. APPLICATIONS While the scope of applications amenable to this technique is extremely broad, it may be helpful to consider a few applications. Matching skills with requirements is common in large diverse organizations. Typically a list of commonly required skills is drawn up and each individual's skills are surveyed. The individual-skill record is inverted (sorted or indexed) on skill. From such an index, a list of individuals having a particular skill is easily produced. Several problems can arise. First, the survey may not include the requisite skills. It is difficult to anticipate all the skills that might be required in an organization. Second, a fair amount of effort is required to prepare the skills matrix and maintain it, hence it is likely to be less than complete or current in its contents. Now consider the case where the employment histories are on disk along with project summaries. After generating an index and periodically adding new files, a current list of individuals that may have the necessary attributes can be generated instantly. Word processing centers in organizations may produce thousands of documents. Typically, these documents are discarded or archived on diskette, which is tantamount to the same thing. A typical letter contains about 2000 characters and costs about $2 to type. The cost of permanent disk storage for such a letter is less than one penny yet virtually no one could can make a general query such as "has anyone corresponded with John Smith in Nigeria on the use of time controlled valves in drip irrigation?". The inverted index makes such a query and even more obscure queries simple. Corporate and institutional libraries have proven invaluable in support of professional activity. Eventually, much of the material is indexed or abstracted, and in the hands of a skilled librarian the needed references may make it into the hands of users but probably not in time for critical needs. Now consider the situation in which staff members are encouraged to make brief comments about significant material they encounter in routine operations. Collected on disk and indexed say once per week, any professional could instantly generate a list of references along with the judgments of their associates. Real estate brokers have combined with their colleagues through "multiple listing services" to match buyers with properties. Although the information management services provided are expensive and generally cumbersome, they provide little more information than property descriptions, buyer profiles, a PC with a hard disk, a word processing program, and an inverted index would provide at a fraction of the cost. PROTOTYPE A prototype of the system is available for MS-DOS and PC-DOS compatible systems at nominal cost. Firms, institutions, and individuals wishing to incorporate the technology in their systems and operations should write or call for details.